{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CS-109: Fall 2015  -- Lab 4\n",
    "\n",
    "# Regression in Python\n",
    "\n",
    "***\n",
    "This is a very quick run-through of some basic statistical concepts\n",
    "\n",
    "* Regression Models\n",
    "    * Linear, Logistic\n",
    "* Prediction using linear regression\n",
    "* Some re-sampling methods    \n",
    "    * Train-Test splits\n",
    "    * Cross Validation\n",
    "\n",
    "Linear regression is used to model and predict continuous outcomes while logistic regression is used to model binary outcomes. We'll see some examples of linear regression as well as Train-test splits.\n",
    "\n",
    "\n",
    "The packages we'll cover are: `statsmodels`, `seaborn`, and `scikit-learn`.\n",
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img width=600 height=300 src=\"https://imgs.xkcd.com/comics/sustainable.png\"/>\n",
    "***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# special IPython command to prepare the notebook for matplotlib\n",
    "%matplotlib inline \n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scipy.stats as stats\n",
    "import matplotlib.pyplot as plt\n",
    "import sklearn\n",
    "import statsmodels.api as sm\n",
    "\n",
    "import seaborn as sns\n",
    "sns.set_style(\"whitegrid\")\n",
    "sns.set_context(\"poster\")\n",
    "\n",
    "# special matplotlib argument for improved plots\n",
    "from matplotlib import rcParams\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 0: Piazza Posting Guidelines\n",
    "***\n",
    "<div class=\"span5 alert alert-info\">\n",
    "The high volume of posts on piazza has made it very difficult for us to answer all of your questions.  For this reason, we are taking measures to decrease the volume of posts, and increase their quality.  Below is the general format that we now require for all posts.  ONLY posts that are in this format will be answered by staff.  Posts that are not in this format will be made private, and students will be asked to reformat their question.\n",
    "</div>\n",
    " \n",
    "1. At the top of your post, make a list of all the keywords you entered in your piazza search when looking for answers to your question.  Also include a list of keywords you searched in google.  Provide links to all the posts that you read that were relevant, but did not quite answer your question. \n",
    " \n",
    "2. If you are sure that your question is not a duplicate question, write down your question in as much detail as possible without posting code.  You can post your error messages, and unit tests.\n",
    " \n",
    "3. Include the steps you have taken for debugging, and what the outcome was.  You must have spent at least 30 minutes trying to debug before you post a question. \n",
    " \n",
    "4. Post your question in the most specific folder possible, for example hw1-1.1 (see @564)\n",
    " \n",
    "5. Follow up!  Describe the solution that worked for you.  Also, click the \"resolve\" button if your problem has been resolved.\n",
    " \n",
    "See @310 for general posting guidelines and etiquette.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "# Part 1: Linear Regression\n",
    "### Purpose of linear regression\n",
    "***\n",
    "<div class=\"span5 alert alert-info\">\n",
    "\n",
    "<p> Given a dataset $X$ and $Y$, linear regression can be used to: </p>\n",
    "<ul>\n",
    "  <li> Build a <b>predictive model</b> to predict future values of $X_i$ without a $Y$ value.  </li>\n",
    "  <li> Model the <b>strength of the relationship</b> between each dependent variable $X_i$ and $Y$</li>\n",
    "    <ul>\n",
    "      <li> Sometimes not all $X_i$ will have a relationship with $Y$</li>\n",
    "      <li> Need to figure out which $X_i$ contributes most information to determine $Y$ </li>\n",
    "    </ul>\n",
    "   <li>Linear regression is used in so many applications that I won't warrant this with examples. It is in many cases, the first pass prediction algorithm for continuous outcomes. </li>\n",
    "</ul>\n",
    "</div>\n",
    "\n",
    "### A brief recap\n",
    "***\n",
    "\n",
    "[Linear Regression](http://en.wikipedia.org/wiki/Linear_regression) is a method to model the relationship between a set of independent variables $X$ (also knowns as explanatory variables, features, predictors) and a dependent variable $Y$.  This method assumes the relationship between each predictor $X$ is linearly related to the dependent variable $Y$.  \n",
    "\n",
    "$$ Y = \\beta_0 + \\beta_1 X + \\epsilon$$\n",
    "\n",
    "where $\\epsilon$ is considered as an unobservable random variable that adds noise to the linear relationship. This is the simplest form of linear regression (one variable), we'll call this the simple model. \n",
    "\n",
    "* $\\beta_0$ is the intercept of the linear model\n",
    "\n",
    "* Multiple linear regression is when you have more than one independent variable\n",
    "    * $X_1$, $X_2$, $X_3$, $\\ldots$\n",
    "\n",
    "$$ Y = \\beta_0 + \\beta_1 X_1 + \\ldots + \\beta_p X_p + \\epsilon$$ \n",
    "\n",
    "* Back to the simple model. The model in linear regression is the *conditional mean* of $Y$ given the values in $X$ is expressed a linear function.  \n",
    "\n",
    "$$ y = f(x) = E(Y | X = x)$$ \n",
    "\n",
    "![conditional mean](images/conditionalmean.png)\n",
    "http://www.learner.org/courses/againstallodds/about/glossary.html\n",
    "\n",
    "* The goal is to estimate the coefficients (e.g. $\\beta_0$ and $\\beta_1$). We represent the estimates of the coefficients with a \"hat\" on top of the letter.  \n",
    "\n",
    "$$ \\hat{\\beta}_0, \\hat{\\beta}_1 $$\n",
    "\n",
    "* Once you estimate the coefficients $\\hat{\\beta}_0$ and $\\hat{\\beta}_1$, you can use these to predict new values of $Y$\n",
    "\n",
    "$$\\hat{y} = \\hat{\\beta}_0 + \\hat{\\beta}_1 x_1$$\n",
    "\n",
    "\n",
    "* How do you estimate the coefficients? \n",
    "    * There are many ways to fit a linear regression model\n",
    "    * The method called **least squares** is one of the most common methods\n",
    "    * We will discuss least squares today\n",
    "    \n",
    "### Estimating $\\hat\\beta$: Least squares\n",
    "***\n",
    "[Least squares](http://en.wikipedia.org/wiki/Least_squares) is a method that can estimate the coefficients of a linear model by minimizing the difference between the following: \n",
    "\n",
    "$$ S = \\sum_{i=1}^N r_i = \\sum_{i=1}^N (y_i - (\\beta_0 + \\beta_1 x_i))^2 $$\n",
    "\n",
    "where $N$ is the number of observations.  \n",
    "\n",
    "* We will not go into the mathematical details, but the least squares estimates $\\hat{\\beta}_0$ and $\\hat{\\beta}_1$ minimize the sum of the squared residuals $r_i = y_i - (\\beta_0 + \\beta_1 x_i)$ in the model (i.e. makes the difference between the observed $y_i$ and linear model $\\beta_0 + \\beta_1 x_i$ as small as possible). \n",
    "\n",
    "The solution can be written in compact matrix notation as\n",
    "\n",
    "$$\\hat\\beta =  (X^T X)^{-1}X^T Y$$ \n",
    "\n",
    "We wanted to show you this in case you remember linear algebra, in order for this solution to exist we need $X^T X$ to be invertible. Of course this requires a few extra assumptions, $X$ must be full rank so that $X^T X$ is invertible, etc. **This is important for us because this means that having redundant features in our regression models will lead to poorly fitting (and unstable) models.** We'll see an implementation of this in the extra linear regression example.\n",
    "\n",
    "**Note**: The \"hat\" means it is an estimate of the coefficient.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "# Part 2: Boston Housing Data Set\n",
    "\n",
    "The [Boston Housing data set](https://archive.ics.uci.edu/ml/datasets/Housing) contains information about the housing values in suburbs of Boston.  This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository. \n",
    "\n",
    "\n",
    "## Load the Boston Housing data set from `sklearn`\n",
    "***\n",
    "\n",
    "This data set is available in the [sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston) python module which is how we will access it today.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_boston\n",
    "boston = load_boston()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "boston.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "boston.data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Print column names\n",
    "print boston.feature_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Print description of Boston housing data set\n",
    "print boston.DESCR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's explore the data set itself. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "bos = pd.DataFrame(boston.data)\n",
    "bos.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are no column names in the DataFrame. Let's add those. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "bos.columns = boston.feature_names\n",
    "bos.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have a pandas DataFrame called `bos` containing all the data we want to use to predict Boston Housing prices.  Let's create a variable called `PRICE` which will contain the prices. This information is contained in the `target` data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print boston.target.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "bos['PRICE'] = boston.target\n",
    "bos.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## EDA and Summary Statistics\n",
    "***\n",
    "\n",
    "Let's explore this data set.  First we use `describe()` to get basic summary statistics for each of the columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "bos.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scatter plots\n",
    "***\n",
    "\n",
    "Let's look at some scatter plots for three variables: 'CRIM', 'RM' and 'PTRATIO'. \n",
    "\n",
    "What kind of relationship do you see? e.g. positive, negative?  linear? non-linear? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(bos.CRIM, bos.PRICE)\n",
    "plt.xlabel(\"Per capita crime rate by town (CRIM)\")\n",
    "plt.ylabel(\"Housing Price\")\n",
    "plt.title(\"Relationship between CRIM and Price\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(bos.RM, bos.PRICE)\n",
    "plt.xlabel(\"Average number of rooms per dwelling (RM)\")\n",
    "plt.ylabel(\"Housing Price\")\n",
    "plt.title(\"Relationship between RM and Price\")\n",
    "\n",
    "# sns.regplot(y=\"PRICE\", x=\"RM\", data=bos, fit_reg = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We can also use seaborn regplot for this\n",
    "#  This provides automatic linear regression fits (useful for data exploration later on)\n",
    "\n",
    "sns.regplot(y=\"PRICE\", x=\"RM\", data=bos, fit_reg = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(bos.PTRATIO, bos.PRICE)\n",
    "plt.xlabel(\"Pupil-to-Teacher Ratio (PTRATIO)\")\n",
    "plt.ylabel(\"Housing Price\")\n",
    "plt.title(\"Relationship between PTRATIO and Price\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Histograms\n",
    "***\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.hist(bos.CRIM)\n",
    "plt.title(\"CRIM\")\n",
    "plt.xlabel(\"Crime rate per capita\")\n",
    "plt.ylabel(\"Frequencey\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.hist(bos.PRICE)\n",
    "plt.title('Housing Prices: $Y_i$')\n",
    "plt.xlabel('Price')\n",
    "plt.ylabel('Frequency')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear regression with  Boston housing data example\n",
    "***\n",
    "\n",
    "Here, \n",
    "\n",
    "$Y$ = boston housing prices (also called \"target\" data in python)\n",
    "\n",
    "and\n",
    "\n",
    "$X$ = all the other features (or independent variables)\n",
    "\n",
    "which we will use to fit a linear regression model and predict Boston housing prices. We will use the least squares method as the way to estimate the coefficients.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll use two ways of fitting a linear regression. We recommend the first but the second is also powerful in its features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fitting Linear Regression using `statsmodels`\n",
    "***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import regression modules\n",
    "# ols - stands for Ordinary least squares, we'll use this\n",
    "import statsmodels.api as sm\n",
    "from statsmodels.formula.api import ols"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# statsmodels works nicely with pandas dataframes\n",
    "# The thing inside the \"quotes\" is called a formula, a bit on that below\n",
    "m = ols('PRICE ~ RM',bos).fit()\n",
    "print m.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Interpreting coefficients\n",
    "\n",
    "There is a ton of information in this output. But we'll concentrate on the coefficient table (middle table). We can interpret the `RM` coefficient (9.1021) by first noticing that the p-vale (under `P>|t|`) is so small, basically zero. We can interpret the coefficient as, if we compare two groups of towns, one where the average number of rooms is say $5$ and the other group is the same except that they all have $6$ rooms. For these two groups the average difference in house prives is about $9.1$ (in thousands) so about $\\$9,100$ difference. The confidence interval fives us a range of plausible values for this difference, about ($\\$8,279, \\$9,925$), deffinitely not chump change. \n",
    "\n",
    "In the last section of this Lab we discuss p-values in more detail. Please have a read though it and ask your TFs for more help."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####  `statsmodels` formulas\n",
    "***\n",
    "This formula notation will seem familiar to `R` users, but will take some getting used to for people coming from other languages or are new to statistics.\n",
    "\n",
    "The formula gives instruction for a general structure for a regression call. For `statsmodels` (`ols` or `logit`) calls you need to have a Pandas dataframe with column names that you will add to your formula. In the below example you need a pandas data frame that includes the columns named (`Outcome`, `X1`,`X2`, ...), bbut you don't need to build a new dataframe for every regression. Use the same dataframe with all these things in it. The structure is very simple:\n",
    "\n",
    "`Outcome ~ X1`\n",
    "\n",
    "But of course we want to to be able to handle more complex models, for example multiple regression is doone like this:\n",
    "\n",
    "`Outcome ~ X1 + X2 + X3`\n",
    "\n",
    "This is the very basic structure but it should be enough to get you through the homework. Things can get much more complex, for a quick run-down of further uses see the `statsmodels` [help page](http://statsmodels.sourceforge.net/devel/example_formulas.html).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how our model actually fit our data. We can see below that there is a ceiling effect, we should probably look into that. Also, for large values of $Y$ we get underpredictions, most predictions are below the 45-degree gridlines. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(bos['PRICE'], m.fittedvalues)\n",
    "plt.xlabel(\"Prices: $Y_i$\")\n",
    "plt.ylabel(\"Predicted prices: $\\hat{Y}_i$\")\n",
    "plt.title(\"Prices vs Predicted Prices: $Y_i$ vs $\\hat{Y}_i$\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fitting Linear Regression using `sklearn`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "X = bos.drop('PRICE', axis = 1)\n",
    "\n",
    "# This creates a LinearRegression object\n",
    "lm = LinearRegression()\n",
    "lm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What can you do with a LinearRegression object? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Look inside linear regression object\n",
    "# LinearRegression.<tab>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Main functions | Description\n",
    "--- | --- \n",
    "`lm.fit()` | Fit a linear model\n",
    "`lm.predit()` | Predict Y using the linear model with estimated coefficients\n",
    "`lm.score()` | Returns the coefficient of determination (R^2). *A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What output can you get?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Look inside lm object\n",
    "# lm.<tab>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Output | Description\n",
    "--- | --- \n",
    "`lm.coef_` | Estimated coefficients\n",
    "`lm.intercept_` | Estimated intercept "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fit a linear model\n",
    "***\n",
    "\n",
    "The `lm.fit()` function estimates the coefficients the linear regression using least squares. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Use all 13 predictors to fit linear regression model\n",
    "lm.fit(X, bos.PRICE)\n",
    "\n",
    "# your turn\n",
    "# notice fit_intercept=True and normalize=True\n",
    "# How would you change the model to not fit an intercept term? \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Estimated intercept and coefficients\n",
    "\n",
    "Let's look at the estimated coefficients from the linear model using `1m.intercept_` and `lm.coef_`.  \n",
    "\n",
    "After we have fit our linear regression model using the least squares method, we want to see what are the estimates of our coefficients $\\beta_0$, $\\beta_1$, ..., $\\beta_{13}$: \n",
    "\n",
    "$$ \\hat{\\beta}_0, \\hat{\\beta}_1, \\ldots, \\hat{\\beta}_{13} $$\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print 'Estimated intercept coefficient:', lm.intercept_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print 'Number of coefficients:', len(lm.coef_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# The coefficients\n",
    "pd.DataFrame(zip(X.columns, lm.coef_), columns = ['features', 'estimatedCoefficients'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict Prices \n",
    "\n",
    "We can calculate the predicted prices ($\\hat{Y}_i$) using `lm.predict`. \n",
    "\n",
    "$$ \\hat{Y}_i = \\hat{\\beta}_0 + \\hat{\\beta}_1 X_1 + \\ldots \\hat{\\beta}_{13} X_{13} $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# first five predicted prices\n",
    "lm.predict(X)[0:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.hist(lm.predict(X))\n",
    "plt.title('Predicted Housing Prices (fitted values): $\\hat{Y}_i$')\n",
    "plt.xlabel('Price')\n",
    "plt.ylabel('Frequency')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plot the true prices compared to the predicted prices to see they disagree, we saw this exactly befor but this is how you access the predicted values in using `sklearn`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# plt.scatter(bos.PRICE, lm.predict(X))\n",
    "# plt.xlabel(\"Prices: $Y_i$\")\n",
    "# plt.ylabel(\"Predicted prices: $\\hat{Y}_i$\")\n",
    "# plt.title(\"Prices vs Predicted Prices: $Y_i$ vs $\\hat{Y}_i$\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Residual sum of squares\n",
    "\n",
    "Let's calculate the residual sum of squares \n",
    "\n",
    "$$ S = \\sum_{i=1}^N r_i = \\sum_{i=1}^N (y_i - (\\beta_0 + \\beta_1 x_i))^2 $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print np.sum((bos.PRICE - lm.predict(X)) ** 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mean squared error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mseFull = np.mean((bos.PRICE - lm.predict(X)) ** 2)\n",
    "print mseFull"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Relationship between `PTRATIO` and housing price\n",
    "***\n",
    "\n",
    "Try fitting a linear regression model using only the 'PTRATIO' (pupil-teacher ratio by town)\n",
    "\n",
    "Calculate the mean squared error. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lm = LinearRegression()\n",
    "lm.fit(X[['PTRATIO']], bos.PRICE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "msePTRATIO = np.mean((bos.PRICE - lm.predict(X[['PTRATIO']])) ** 2)\n",
    "print msePTRATIO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also plot the fitted linear regression line. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(bos.PTRATIO, bos.PRICE)\n",
    "plt.xlabel(\"Pupil-to-Teacher Ratio (PTRATIO)\")\n",
    "plt.ylabel(\"Housing Price\")\n",
    "plt.title(\"Relationship between PTRATIO and Price\")\n",
    "\n",
    "plt.plot(bos.PTRATIO, lm.predict(X[['PTRATIO']]), color='blue', linewidth=3)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Your turn\n",
    "***\n",
    "\n",
    "Try fitting a linear regression model using three independent variables\n",
    "\n",
    "1. 'CRIM' (per capita crime rate by town)\n",
    "2. 'RM' (average number of rooms per dwelling)\n",
    "3. 'PTRATIO' (pupil-teacher ratio by town)\n",
    "\n",
    "Calculate the mean squared error. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# your turn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Other important things to think about when fitting a linear regression model\n",
    "***\n",
    "<div class=\"span5 alert alert-danger\">\n",
    "<ul>\n",
    "  <li>**Linearity**. The dependent variable $Y$ is a linear combination of the regression coefficients and the independent variables $X$. </li>\n",
    "  <li>**Constant standard deviation**. The SD of the dependent variable $Y$ should be constant for different values of X.  \n",
    "        <ul>\n",
    "            <li>e.g. PTRATIO\n",
    "        </ul>\n",
    "    </li>\n",
    "  <li> **Normal distribution for errors**.  The $\\epsilon$ term we discussed at the beginning are assumed to be normally distributed. \n",
    "  $$ \\epsilon_i \\sim N(0, \\sigma^2)$$\n",
    "Sometimes the distributions of responses $Y$ may not be normally distributed at any given value of $X$.  e.g. skewed positively or negatively. </li>\n",
    "<li> **Independent errors**.  The observations are assumed to be obtained independently.\n",
    "    <ul>\n",
    "        <li>e.g. Observations across time may be correlated\n",
    "    </ul>\n",
    "</li>\n",
    "</ul>  \n",
    "\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 3: Training and Test Data sets\n",
    "\n",
    "### Purpose of splitting data into Training/testing sets\n",
    "***\n",
    "<div class=\"span5 alert alert-info\">\n",
    "\n",
    "<p> Let's stick to the linear regression example: </p>\n",
    "<ul>\n",
    "  <li> We built our model with the requirement that the model fit the data well. </li>\n",
    "  <li> As a side-effect, the model will fit <b>THIS</b> dataset well. What about new data? </li>\n",
    "    <ul>\n",
    "      <li> We wanted the model for predictions, right?</li>\n",
    "    </ul>\n",
    "  <li> One simple solution, leave out some data (for <b>testing</b>) and <b>train</b> the model on the rest </li>\n",
    "  <li> This also leads directly to the idea of cross-validation, next section. </li>  \n",
    "</ul>\n",
    "</div>\n",
    "\n",
    "***\n",
    "\n",
    "One way of doing this is you can create training and testing data sets manually. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_train = X[:-50]\n",
    "X_test = X[-50:]\n",
    "Y_train = bos.PRICE[:-50]\n",
    "Y_test = bos.PRICE[-50:]\n",
    "print X_train.shape\n",
    "print X_test.shape\n",
    "print Y_train.shape\n",
    "print Y_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another way, is to split the data into random train and test subsets using the function `train_test_split` in `sklearn.cross_validation`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# let's look at the function in the help file\n",
    "# sklearn.cross_validation.train_test_split?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(\n",
    "    X, bos.PRICE, test_size=0.33, random_state = 5)\n",
    "print X_train.shape\n",
    "print X_test.shape\n",
    "print Y_train.shape\n",
    "print Y_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your turn.  Let's build a linear regression model using our new training data sets. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# your turn\n",
    "lm = LinearRegression()\n",
    "lm.fit(X_train, Y_train)\n",
    "pred_train = lm.predict(X_train)\n",
    "pred_test = lm.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, calculate the mean squared error using just the test data and compare to mean squared from using all the data to fit the model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# your turn\n",
    "print \"Fit a model X_train, and calculate MSE with Y_train:\", np.mean((Y_train - lm.predict(X_train)) ** 2)\n",
    "print \"Fit a model X_train, and calculate MSE with X_test, Y_test:\", np.mean((Y_test - lm.predict(X_test)) ** 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Residual plots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(lm.predict(X_train), lm.predict(X_train) - Y_train, c='b', s=40, alpha=0.5)\n",
    "plt.scatter(lm.predict(X_test), lm.predict(X_test) - Y_test, c='g', s=40)\n",
    "plt.hlines(y = 0, xmin=0, xmax = 50)\n",
    "plt.title('Residual Plot using training (blue) and test (green) data')\n",
    "plt.ylabel('Residuals')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### K-fold Cross-validation as an extension of this idea\n",
    "***\n",
    "<div class=\"span5 alert alert-info\">\n",
    "\n",
    "<p> A simple extension of the Test/train split is called K-fold cross-validation.  </p>\n",
    "\n",
    "<p> Here's the procedure:</p>\n",
    "<ul>\n",
    "  <li> randomly assign your $n$ samples to one of $K$ groups. They'll each have about $n/k$ samples</li>\n",
    "  <li> For each group $k$: </li>\n",
    "    <ul>\n",
    "      <li> Fit the model (e.g. run regression) on all data excluding the $k^{th}$ group</li>\n",
    "      <li> Use the model to predict the outcomes in group $k$</li>\n",
    "      <li> Calculate your prediction error for each observation in $k^{th}$ group (e.g. $(Y_i - \\hat{Y}_i)^2$ for regression, $\\mathbb{1}(Y_i = \\hat{Y}_i)$ for logistic regression). </li>\n",
    "    </ul>\n",
    "  <li> Calculate the average prediction error across all samples $Err_{CV} = \\frac{1}{n}\\sum_{i=1}^n (Y_i - \\hat{Y}_i)^2$ </li>\n",
    "</ul>\n",
    "</div>\n",
    "\n",
    "***\n",
    "\n",
    "Luckily you don't have to do this entire process all by hand (``for`` loops, etc.) every single time, ``sci-kit learn`` has a very nice implementation of this, have a look at the [documentation](http://scikit-learn.org/stable/modules/cross_validation.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Another Example: Old Faithful Geyser Data Set\n",
    "***\n",
    "\n",
    "The [Old Faithful Geyser](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/faithful.html) data set is a well-known data set that depicts the relationship of the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA [[webcam]](http://yellowstone.net/webcams/). This data set is found in the base installation of the [R programming language](http://cran.r-project.org).  \n",
    "\n",
    "`faithful` is a data set with 272 observations on 2 variables.\n",
    "\n",
    "Column name| Description \n",
    "--- | --- \n",
    "eruptions | Eruption time (in mins)\n",
    "waiting\t| Waiting time to next eruption (in mins)\n",
    "\n",
    "There is a function in `statsmodels` (or `sm` for short) called `sm.datasets.get_rdataset` which will download and return a data set found in [R](http://cran.r-project.org).  \n",
    "\n",
    "Let's import the `faithful` dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "faithful = sm.datasets.get_rdataset(\"faithful\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Let's look at the help file\n",
    "sm.datasets.get_rdataset?\n",
    "faithful?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "faithful.title"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "faithful = faithful.data\n",
    "faithful.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "faithful.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Histogram \n",
    "***\n",
    "\n",
    "Create a histogram of the time between eruptions. What do you see? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.hist(faithful.waiting)\n",
    "plt.xlabel('Waiting time to next eruption (in mins)')\n",
    "plt.ylabel('Frequency')\n",
    "plt.title('Old Faithful Geyser time between eruption')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This histogram indicates [Old Faithful isn’t as “faithful” as you might think](http://people.stern.nyu.edu/jsimonof/classes/2301/pdf/geystime.pdf). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scatter plot \n",
    "***\n",
    "\n",
    "Create a scatter plot of the `waiting` on the x-axis and the `eruptions` on the y-axis. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(faithful.waiting, faithful.eruptions)\n",
    "plt.xlabel('Waiting time to next eruption (in mins)')\n",
    "plt.ylabel('Eruption time (in mins)')\n",
    "plt.title('Old Faithful Geyser')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build a linear regression to predict eruption time using `statsmodels`\n",
    "***\n",
    "\n",
    "Now let's build a linear regression model for the `faithful` DataFrame, and estimate the next eruption duration if the waiting time since the last eruption has been 75 minutes.\n",
    "\n",
    "$$ Eruptions = \\beta_0 + \\beta_1 * Waiting + \\epsilon $$ "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X = faithful.waiting\n",
    "y = faithful.eruptions\n",
    "model = sm.OLS(y, X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Let's look at the options in model\n",
    "# model.<tab>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "results = model.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Let's look at the options in results\n",
    "# results.<tab>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print results.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "results.params.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We notice, there is no intercept ($\\beta_0$) fit in this linear model.  To add it, we can use the function `sm.add_constant`.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X = sm.add_constant(X)\n",
    "X.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's fit a linear regression model with an intercept. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "modelW0 = sm.OLS(y, X)\n",
    "resultsW0 = modelW0.fit()\n",
    "print resultsW0.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to predict the time to the next eruption using a waiting time of 75, you can directly estimate this using the equation \n",
    "\n",
    "$$ \\hat{y} = \\hat{\\beta}_0 + \\hat{\\beta}_1 * 75 $$ \n",
    "\n",
    "or you can use `results.predict`.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "newX = np.array([1,75])\n",
    "resultsW0.params[0]*newX[0] + resultsW0.params[1] * newX[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "resultsW0.predict(newX)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on this linear regression, if the waiting time since the last eruption has been 75 minutes, we expect the next one to last approximately 3.80 minutes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plot the regression line \n",
    "***\n",
    "\n",
    "Instead of using `resultsW0.predict(X)`, we can use `resultsW0.fittedvalues` which are the $\\hat{y}$. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(faithful.waiting, faithful.eruptions)\n",
    "plt.xlabel('Waiting time to next eruption (in mins)')\n",
    "plt.ylabel('Eruption time (in mins)')\n",
    "plt.title('Old Faithful Geyser')\n",
    "\n",
    "plt.plot(faithful.waiting, resultsW0.fittedvalues, color='blue', linewidth=3)\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Residuals, residual sum of squares, mean squared error\n",
    "***\n",
    "\n",
    "Recall, we can directly calculate the residuals as \n",
    "\n",
    "$$r_i = y_i - (\\hat{\\beta}_0 + \\hat{\\beta}_1 x_i)$$\n",
    "\n",
    "To calculate the residual sum of squares, \n",
    "\n",
    "$$ S = \\sum_{i=1}^n r_i = \\sum_{i=1}^n (y_i - (\\hat{\\beta}_0 + \\hat{\\beta}_1 x_i))^2 $$\n",
    "\n",
    "where $n$ is the number of observations.  Alternatively, we can simply ask for the residuals using `resultsW0.predict`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "resids = faithful.eruptions - resultsW0.predict(X)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "resids = resultsW0.resid\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.plot(faithful.waiting, resids, 'o')\n",
    "plt.hlines(y = 0, xmin=40, xmax = 100)\n",
    "plt.xlabel('Waiting time')\n",
    "plt.ylabel('Residuals')\n",
    "plt.title('Residual Plot')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The residual sum of squares: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print np.sum((faithful.eruptions - resultsW0.predict(X)) ** 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mean squared error: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print np.mean((faithful.eruptions - resultsW0.predict(X)) ** 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build a linear regression to predict eruption time using least squares \n",
    "***\n",
    "\n",
    "Now let's build a linear regression model for the `faithful` DataFrame, but instead of using `statmodels` (or `sklearn`), let's use the least squares estimates of the coefficients for the linear regression model.\n",
    "\n",
    "$$ \\hat{\\beta} = (X^{\\top}X)^{-1} X^{\\top}Y $$ \n",
    "\n",
    "The `numpy` function [`np.dot`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html#numpy.dot) is the dot product (or inner product) of two vectors (or arrays in python).  \n",
    "\n",
    "The `numpy` function [`np.linalg.inv`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.inv.html#numpy.linalg.inv) can be used to compute the inverse of a matrix. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X = sm.add_constant(faithful.waiting)\n",
    "y = faithful.eruptions\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, compute $X^{\\top}X$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "np.dot(X.T, X)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, compute the inverse of $X^{\\top}X$ or $(X^{\\top}X)^{-1}$. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "np.linalg.inv(np.dot(X.T, X))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, compute $\\hat{\\beta} = (X^{\\top}X)^{-1} X^{\\top}Y $"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "beta = np.linalg.inv(np.dot(X.T, X)).dot(X.T).dot(y)\n",
    "print \"Directly estimating beta:\", beta\n",
    "print \"Estimating beta using statmodels: \", resultsW0.params.values\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 4: Many different types of regression\n",
    "***\n",
    "<div class=\"span5 alert alert-info\">\n",
    "\n",
    "<p>You do not always have a continuous $y$ variable that you are measuring. Sometimes it may be binary (e.g. 0 or 1). Sometimes it may be count data.  What do you do?</p>\n",
    "\n",
    "<p>Use other types of regression besides just simple linear regression.</p>  \n",
    "\n",
    "<p>[Nice summary of several types of regression](http://www.datasciencecentral.com/profiles/blogs/10-types-of-regressions-which-one-to-use). </p>\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 5: Logistic Regression\n",
    "***\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"span5 alert alert-info\">\n",
    "<p>Logistic regression is a probabilistic model that links observed binary data to a set of features.</p>\n",
    "\n",
    "<p>Suppose that we have a set of binary (that is, taking the values 0 or 1) observations $Y_1,\\cdots,Y_n$, and for each observation $Y_i$ we have a vector of features $X_i$. The logistic regression model assumes that there is some set of **weights**, **coefficients**, or **parameters** $\\beta$, one for each feature, so that the data were generated by flipping a weighted coin whose probability of giving a 1 is given by the following equation:\n",
    "\n",
    "$$\n",
    "P(Y_i = 1) = \\mathrm{logistic}(\\sum \\beta_i X_i),\n",
    "$$\n",
    "\n",
    "where\n",
    "\n",
    "$$\n",
    "\\mathrm{logistic}(x) = \\frac{e^x}{1+e^x}.\n",
    "$$\n",
    "</p>\n",
    "<p>When we *fit* a logistic regression model, we determine values for each $\\beta$ that allows the model to best fit the *training data* we have observed. Once we do this, we can use these coefficients to make predictions about data we have not yet observed.</p>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From http://www.edwardtufte.com/tufte/ebooks, in \"Visual and Statistical Thinking: \n",
    "Displays of Evidence for Making Decisions\":\n",
    "\n",
    ">On January 28, 1986, the space shuttle Challenger exploded and seven astronauts died because two rubber O-rings leaked. These rings had lost their resiliency because the shuttle was launched on a very cold day. Ambient temperatures were in the low 30s and the O-rings themselves were much colder, less than 20F.\n",
    "\n",
    ">One day before the flight, the predicted temperature for the launch was 26F to 29F. Concerned that the rings would not seal at such a cold temperature, the engineers who designed the rocket opposed launching Challenger the next day.\n",
    "\n",
    "But they did not make their case persuasively, and were over-ruled by NASA."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from IPython.display import Image as Im\n",
    "from IPython.display import display\n",
    "Im('./images/shuttle.png')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The image above shows the leak, where the O-ring failed.\n",
    "\n",
    "We have here data on previous failures of the O-rings at various temperatures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data=np.array([[float(j) for j in e.strip().split()] for e in open(\"./data/chall.txt\")])\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets plot this data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# fit logistic regression model\n",
    "import statsmodels.api as sm\n",
    "from statsmodels.formula.api import logit, glm, ols\n",
    "\n",
    "# statsmodels works nicely with pandas dataframes\n",
    "dat = pd.DataFrame(data, columns = ['Temperature', 'Failure'])\n",
    "logit_model = logit('Failure ~ Temperature',dat).fit()\n",
    "print logit_model.summary()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Interpreting p-values:\n",
    "Generally we'd like the p-values to be very small as they represent the probability that we observed such an strong relationship between temperature and O-ring failures purely by chance. So when the p-value is small (we usually consider \"small\" as less than 0.05), what we're saying is that based on the data we observed, we know **fairly certainly** that temperature is strongly associated with the failure of O-rings. This is a very powerful statement that can take us a long way in terms of learning if used properly. Have a look at the Wikipedia page on [p-values](https://en.wikipedia.org/wiki/P-value) for a quick reminder.\n",
    "\n",
    "There are some issues with testing many many hypotheses that we'll also encounter in the homework. But generally, the idea is that the data may (or may not) have information about things you're interested in. We ask the data questions through hypotheses, but the more questions we ask of it, the higher chances we have of the data actually showing associations at random. Have alook at [this article](https://en.wikipedia.org/wiki/Multiple_comparisons_problem) to get an idea of the problem and some solutions. We'll be considering a very crude solution known as the [Bonferroni correction](https://en.wikipedia.org/wiki/Bonferroni_correction) but that is by no means the best solution. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# calculate predicted failure probabilities for new termperatures\n",
    "x = np.linspace(50, 85, 1000)\n",
    "p = logit_model.params\n",
    "eta = p['Intercept'] + x*p['Temperature']\n",
    "y = np.exp(eta)/(1 + np.exp(eta))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plot the data along with a range of predicted failure probabilities for unobserved temperatures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# plot data\n",
    "temps, pfail = data[:,0], data[:,1]\n",
    "plt.scatter(temps, pfail)\n",
    "axes=plt.gca()\n",
    "plt.xlabel('Temperature')\n",
    "plt.ylabel('Failure')\n",
    "plt.title('O-ring failures')\n",
    "\n",
    "# plot fitted values\n",
    "plt.plot(x, y)\n",
    "\n",
    "# change limits, for a nicer plot\n",
    "plt.xlim(50, 85)\n",
    "plt.ylim(-0.1, 1.1)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can interpret the output from a logistic regression by looking at the coefficient of temperature (as well as the p-value). Since the coefficient of temperature is negative, we can say that an increase in temperature is associated with a decrease in the odds of having an O-ring failure. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
